iclr 2020
Review for NeurIPS paper: ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding
I could not see a strong motivation for explicitly enforcing sparsity on architecture parameters. This is because there are already many works trying to decouple the dependency of evaluating sub-networks on the training of supernet (i.e., making the correlation higher). This means that we have ways to explicitly decouple the network evaluation with supernet training without adding a sparsity regularizaiton. As far as I know, weight-sharing methods require the BN to be re-calculated [1] to properly measure the Kendall correlation. Other works that can reduce the gap between supernet and sub-networks (e.g.
Sharing Knowledge in Multi-Task Deep Reinforcement Learning
D'Eramo, Carlo, Tateo, Davide, Bonarini, Andrea, Restelli, Marcello, Peters, Jan
We study the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning. We leverage the assumption that learning from different tasks, sharing common properties, is helpful to generalize the knowledge of them resulting in a more effective feature extraction compared to learning a single task. Intuitively, the resulting set of features offers performance benefits when used by Reinforcement Learning algorithms. We prove this by providing theoretical guarantees that highlight the conditions for which is convenient to share representations among tasks, extending the wellknown finite-time bounds of Approximate Value-Iteration to the multi-task setting. In addition, we complement our analysis by proposing multi-task extensions of three Reinforcement Learning algorithms that we empirically evaluate on widely used Reinforcement Learning benchmarks showing significant improvements over the single-task counterparts in terms of sample efficiency and performance. Multi-Task Learning (MTL) ambitiously aims to learn multiple tasks jointly instead of learning them separately, leveraging the assumption that the considered tasks have common properties which can be exploited by Machine Learning (ML) models to generalize the learning of each of them. For instance, the features extracted in the hidden layers of a neural network trained on multiple tasks have the advantage of being a general representation of structures common to each other.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Concrete Problems in AI Safety, Revisited
Raji, Inioluwa Deborah, Dobbe, Roel
As AI systems proliferate in society, the AI community is increasingly preoccupied with the concept of AI Safety, namely the prevention of failures due to accidents that arise from an unanticipated departure of a system's behavior from designer intent in AI deployment. We demonstrate through an analysis of real world cases of such incidents that although current vocabulary captures a range of the encountered issues of AI deployment, an expanded socio-technical framing will be required for a more complete understanding of how AI systems and implemented safety mechanisms fail and succeed in real life. The rapid adoption and widespread experimentation and deployment of AI systems has triggered a variety of failures. Some are catastrophic and visible such as in the case of fatal crashes involving autonomous vehicles. Other failures are much more subtle and pernicious, such as the development of new forms of addiction to personalized content.
- North America > United States > Texas (0.05)
- North America > United States > New York > Richmond County > New York City (0.04)
- North America > United States > New York > Queens County > New York City (0.04)
- (5 more...)
- Transportation > Ground > Road (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.95)
- (2 more...)
Principled Weight Initialization for Hypernetworks
Chang, Oscar, Flokas, Lampros, Lipson, Hod
Hypernetworks are meta neural networks that generate weights for a main neural network in an end-to-end differentiable manner. Despite extensive applications ranging from multi-task learning to Bayesian deep learning, the problem of optimizing hypernetworks has not been studied to date. We observe that classical weight initialization methods like Glorot & Bengio (2010) and He et al. (2015), when applied directly on a hypernet, fail to produce weights for the mainnet in the correct scale. We develop principled techniques for weight initialization in hypernets, and show that they lead to more stable mainnet weights, lower training loss, and faster convergence. Meta-learning describes a broad family of techniques in machine learning that deals with the problem of learning to learn. An emerging branch of meta-learning involves the use of hypernetworks, which are meta neural networks that generate the weights of a main neural network to solve a given task in an end-to-end differentiable manner. Hypernetworks were originally introduced by Ha et al. (2016) as a way to induce weight-sharing and achieve model compression by training the same meta network to learn the weights belonging to different layers in the main network.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Africa > Kenya (0.04)
Accelerating Meta-Learning by Sharing Gradients
The success of gradient-based meta-learning is primarily attributed to its ability to leverage related tasks to learn task-invariant information. However, the absence of interactions between different tasks in the inner loop leads to task-specific over-fitting in the initial phase of meta-training. While this is eventually corrected by the presence of these interactions in the outer loop, it comes at a significant cost of slower meta-learning. To address this limitation, we explicitly encode task relatedness via an inner loop regularization mechanism inspired by multi-task learning. Our algorithm shares gradient information from previously encountered tasks as well as concurrent tasks in the same task batch, and scales their contribution with meta-learned parameters. We show using two popular few-shot classification datasets that gradient sharing enables meta-learning under bigger inner loop learning rates and can accelerate the meta-training process by up to 134%.
Spanish Pre-trained BERT Model and Evaluation Data
Cañete, José, Chaperon, Gabriel, Fuentes, Rodrigo, Ho, Jou-Hui, Kang, Hojin, Pérez, Jorge
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. The field of natural language processing (NLP) has made incredible progress in the last two years.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile (0.05)
- South America > Paraguay > Asunción > Asunción (0.04)
- (2 more...)
Improved Image Wasserstein Attacks and Defenses
Hu, Edward J., Swaminathan, Adith, Salman, Hadi, Yang, Greg
A recently proposed Wasserstein distance-bounded threat model is a promising alternative that limits the perturbation to pixel mass movements. We point out and rectify flaws in the previous definition of the Wasserstein threat model and explore stronger attacks and defenses under our better-defined framework. Lastly, we discuss the inability of current Wasserstein-robust models in defending against perturbations seen in the real world. We will release our code and trained models upon publication. Deep learning approaches to computer vision tasks, such as image classification, are not robust. For example, a data point that is classified correctly can be modified in a nearly imperceptible way to cause the classifier to misclassify it (Szegedy et al., 2013; Goodfellow et al., 2015).
What can human minimal videos tell us about dynamic recognition models?
Ben-Yosef, Guy, Kreiman, Gabriel, Ullman, Shimon
Published as a workshop paper at "Bridging AI and Cognitive Science" (ICLR 2020) In human vision objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that human visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art deep networks for dynamic visual recognition cannot replicate human behavior in these configurations. This gap between humans and machines points to critical mechanisms in human dynamic vision that are lacking in current models.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Israel (0.04)